Hierarchical Clustering¶

Author: Nico Kuijpers
Date: March 28, 2021

Updated by Jacco Snoeren (July 2023)

Introduction¶

Hierarchical clustering is one of the machine learning algorithms that can be applied for unsupervised learning. In this notebook we give an example of how to apply agglomerative clustering on the Iris dataset.

First import the libraries we need.

In [ ]:
import numpy as np
import pandas as pd
import sklearn as sk
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

print('numpy version:', np.__version__)
print('pandas version:', pd.__version__)
print('scikit-learn version:', sk.__version__)
print('seaborn version:', sns.__version__)
print('matplotlib version:', matplotlib.__version__)

%matplotlib inline
numpy version: 1.26.4
pandas version: 2.2.1
scikit-learn version: 1.4.1.post1
seaborn version: 0.13.2
matplotlib version: 3.8.3

📦 Data provisioning¶

To illustrate hierarchical clustering we use the Iris dataset. The dataset consists of 149 entries, 4 input features, and 1 output label. The data set consists of about 50 samples from each of three species of Iris: Iris setosa, Iris virginica, and Iris versicolor. Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

During clustering, we ignore the labels (unsupervised learning). We can compare the results of clustering with the labels afterwards.

For more information on the Iris dataset, see https://en.wikipedia.org/wiki/Iris_flower_data_set

In [ ]:
# Download the Iris dataset from the internet
columns = ["Sepal Length", "Sepal Width", "Petal Length", "Petal Width", "Species"]
df_iris = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", names=columns)

📃 Sample the data¶

Get a first impression of the dataset by printing the data format and showing the first 5 rows and last 5 rows of the DataFrame.

In [ ]:
# Explore the Iris dataset
df_iris.columns = ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width', 'Species']
print('Iris dataset shape: {}'.format(df_iris.shape))
df_iris.head(5)
Iris dataset shape: (150, 5)
Out[ ]:
Sepal Length Sepal Width Petal Length Petal Width Species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
In [ ]:
df_iris.tail(5)
Out[ ]:
Sepal Length Sepal Width Petal Length Petal Width Species
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica

Print the different species in the dataset.

In [ ]:
print(df_iris['Species'].unique())
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']

Print the number of flowers for each species and visualize these numbers using a bar plot.

In [ ]:
print(df_iris['Species'].value_counts())
df_iris['Species'].value_counts().plot(kind='bar')
Species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64
Out[ ]:
<Axes: xlabel='Species'>
No description has been provided for this image

Preprocessing¶

Method pandas.DataFrame.info() prints information about a DataFrame including the index dtype and columns, non-null values, and memory usage. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html.

In [ ]:
df_iris.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Sepal Length  150 non-null    float64
 1   Sepal Width   150 non-null    float64
 2   Petal Length  150 non-null    float64
 3   Petal Width   150 non-null    float64
 4   Species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

Method pandas.DataFrame.describe() generates descriptive statistics. These include central tendency, dispersion, and shape of a dataset's distribution, excluding NaN values. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html#pandas.DataFrame.describe

In [ ]:
df_iris.describe()
Out[ ]:
Sepal Length Sepal Width Petal Length Petal Width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

Analyse the dataset using a box-and-whisker plot generated by method pandas.DataFrame.boxplot(). See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.boxplot.html

For more information on box plots, see https://en.wikipedia.org/wiki/Box_plot

From the box plot below, it can be observed that for Sepal Length and Sepal Width, there is some overlap in values for the three different species. Petal Length and Petal Width show less overlap. This information may be useful when selecting features.

Note: by default, the box plot will be partly shown and a scroll bar appears. To view the entire box plot, select Cell → All Output → Toggle Scrolling.

In [ ]:
iris_features = tuple(df_iris.columns[:4].values)
df_iris.boxplot(column=iris_features, by='Species', figsize=(15,8), layout=(1,4));
No description has been provided for this image

Plot pairwise relationships using method seaborn.pairplot. See https://seaborn.pydata.org/generated/seaborn.pairplot.html

In [ ]:
plt.figure(figsize=(8,8))
ax = sns.pairplot(df_iris, hue='Species')
plt.show
Out[ ]:
<function matplotlib.pyplot.show(close=None, block=None)>
<Figure size 800x800 with 0 Axes>
No description has been provided for this image

💡 Feature selection¶

Use all 4 features for clustering. From the box plots it can be observed that the values range between 0 and 8 cm and that the distribution differs per feature. For instance, Sepal Length ranges between 4 and 8 cm, while Petal Width ranges between 0 and 3 cm. When applying K-means clustering it is important to normalize the data. Using the StandardScaler, the standard score of a sample $x$ is calculated as $z=(x-u)/s$, where $u$ is the mean and $s$ is the standard deviation.

In [ ]:
from sklearn.preprocessing import StandardScaler

# Define X_iris and y_iris
X_iris = df_iris[['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width']]
y_iris = df_iris['Species']

# Create an array of information for each case
case_info = [
    {'features': ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'], 'normalize': False},
    {'features': ['Sepal Length', 'Sepal Width'], 'normalize': False},
    {'features': ['Petal Length', 'Petal Width'], 'normalize': False}
]

# Initialize an empty list to store the resulting arrays
resulting_arrays = []

# Iterate over each case
for info in case_info:
    selected_features = info['features']
    X_iris_case = df_iris[selected_features]

    # Normalize the data if needed
    if info['normalize']:
        scaler_iris = StandardScaler().fit(X_iris_case)
        X_iris_normalized = scaler_iris.transform(X_iris_case)
    else:
        X_iris_normalized = X_iris_case.to_numpy()

    # Reshape X_iris_normalized into a 3x3 array
    # X_iris_3x3 = np.reshape(X_iris_normalized, (3, 3))
    
    # Append the resulting array to the list
    resulting_arrays.append(X_iris_normalized)

for i, array in enumerate(resulting_arrays):
    print(f"Array for case {i}:\n{array}\n")
Array for case 0:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.1 1.5 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.9 1.5]
 [5.5 2.3 4.  1.3]
 [6.5 2.8 4.6 1.5]
 [5.7 2.8 4.5 1.3]
 [6.3 3.3 4.7 1.6]
 [4.9 2.4 3.3 1. ]
 [6.6 2.9 4.6 1.3]
 [5.2 2.7 3.9 1.4]
 [5.  2.  3.5 1. ]
 [5.9 3.  4.2 1.5]
 [6.  2.2 4.  1. ]
 [6.1 2.9 4.7 1.4]
 [5.6 2.9 3.6 1.3]
 [6.7 3.1 4.4 1.4]
 [5.6 3.  4.5 1.5]
 [5.8 2.7 4.1 1. ]
 [6.2 2.2 4.5 1.5]
 [5.6 2.5 3.9 1.1]
 [5.9 3.2 4.8 1.8]
 [6.1 2.8 4.  1.3]
 [6.3 2.5 4.9 1.5]
 [6.1 2.8 4.7 1.2]
 [6.4 2.9 4.3 1.3]
 [6.6 3.  4.4 1.4]
 [6.8 2.8 4.8 1.4]
 [6.7 3.  5.  1.7]
 [6.  2.9 4.5 1.5]
 [5.7 2.6 3.5 1. ]
 [5.5 2.4 3.8 1.1]
 [5.5 2.4 3.7 1. ]
 [5.8 2.7 3.9 1.2]
 [6.  2.7 5.1 1.6]
 [5.4 3.  4.5 1.5]
 [6.  3.4 4.5 1.6]
 [6.7 3.1 4.7 1.5]
 [6.3 2.3 4.4 1.3]
 [5.6 3.  4.1 1.3]
 [5.5 2.5 4.  1.3]
 [5.5 2.6 4.4 1.2]
 [6.1 3.  4.6 1.4]
 [5.8 2.6 4.  1.2]
 [5.  2.3 3.3 1. ]
 [5.6 2.7 4.2 1.3]
 [5.7 3.  4.2 1.2]
 [5.7 2.9 4.2 1.3]
 [6.2 2.9 4.3 1.3]
 [5.1 2.5 3.  1.1]
 [5.7 2.8 4.1 1.3]
 [6.3 3.3 6.  2.5]
 [5.8 2.7 5.1 1.9]
 [7.1 3.  5.9 2.1]
 [6.3 2.9 5.6 1.8]
 [6.5 3.  5.8 2.2]
 [7.6 3.  6.6 2.1]
 [4.9 2.5 4.5 1.7]
 [7.3 2.9 6.3 1.8]
 [6.7 2.5 5.8 1.8]
 [7.2 3.6 6.1 2.5]
 [6.5 3.2 5.1 2. ]
 [6.4 2.7 5.3 1.9]
 [6.8 3.  5.5 2.1]
 [5.7 2.5 5.  2. ]
 [5.8 2.8 5.1 2.4]
 [6.4 3.2 5.3 2.3]
 [6.5 3.  5.5 1.8]
 [7.7 3.8 6.7 2.2]
 [7.7 2.6 6.9 2.3]
 [6.  2.2 5.  1.5]
 [6.9 3.2 5.7 2.3]
 [5.6 2.8 4.9 2. ]
 [7.7 2.8 6.7 2. ]
 [6.3 2.7 4.9 1.8]
 [6.7 3.3 5.7 2.1]
 [7.2 3.2 6.  1.8]
 [6.2 2.8 4.8 1.8]
 [6.1 3.  4.9 1.8]
 [6.4 2.8 5.6 2.1]
 [7.2 3.  5.8 1.6]
 [7.4 2.8 6.1 1.9]
 [7.9 3.8 6.4 2. ]
 [6.4 2.8 5.6 2.2]
 [6.3 2.8 5.1 1.5]
 [6.1 2.6 5.6 1.4]
 [7.7 3.  6.1 2.3]
 [6.3 3.4 5.6 2.4]
 [6.4 3.1 5.5 1.8]
 [6.  3.  4.8 1.8]
 [6.9 3.1 5.4 2.1]
 [6.7 3.1 5.6 2.4]
 [6.9 3.1 5.1 2.3]
 [5.8 2.7 5.1 1.9]
 [6.8 3.2 5.9 2.3]
 [6.7 3.3 5.7 2.5]
 [6.7 3.  5.2 2.3]
 [6.3 2.5 5.  1.9]
 [6.5 3.  5.2 2. ]
 [6.2 3.4 5.4 2.3]
 [5.9 3.  5.1 1.8]]

Array for case 1:
[[5.1 3.5]
 [4.9 3. ]
 [4.7 3.2]
 [4.6 3.1]
 [5.  3.6]
 [5.4 3.9]
 [4.6 3.4]
 [5.  3.4]
 [4.4 2.9]
 [4.9 3.1]
 [5.4 3.7]
 [4.8 3.4]
 [4.8 3. ]
 [4.3 3. ]
 [5.8 4. ]
 [5.7 4.4]
 [5.4 3.9]
 [5.1 3.5]
 [5.7 3.8]
 [5.1 3.8]
 [5.4 3.4]
 [5.1 3.7]
 [4.6 3.6]
 [5.1 3.3]
 [4.8 3.4]
 [5.  3. ]
 [5.  3.4]
 [5.2 3.5]
 [5.2 3.4]
 [4.7 3.2]
 [4.8 3.1]
 [5.4 3.4]
 [5.2 4.1]
 [5.5 4.2]
 [4.9 3.1]
 [5.  3.2]
 [5.5 3.5]
 [4.9 3.1]
 [4.4 3. ]
 [5.1 3.4]
 [5.  3.5]
 [4.5 2.3]
 [4.4 3.2]
 [5.  3.5]
 [5.1 3.8]
 [4.8 3. ]
 [5.1 3.8]
 [4.6 3.2]
 [5.3 3.7]
 [5.  3.3]
 [7.  3.2]
 [6.4 3.2]
 [6.9 3.1]
 [5.5 2.3]
 [6.5 2.8]
 [5.7 2.8]
 [6.3 3.3]
 [4.9 2.4]
 [6.6 2.9]
 [5.2 2.7]
 [5.  2. ]
 [5.9 3. ]
 [6.  2.2]
 [6.1 2.9]
 [5.6 2.9]
 [6.7 3.1]
 [5.6 3. ]
 [5.8 2.7]
 [6.2 2.2]
 [5.6 2.5]
 [5.9 3.2]
 [6.1 2.8]
 [6.3 2.5]
 [6.1 2.8]
 [6.4 2.9]
 [6.6 3. ]
 [6.8 2.8]
 [6.7 3. ]
 [6.  2.9]
 [5.7 2.6]
 [5.5 2.4]
 [5.5 2.4]
 [5.8 2.7]
 [6.  2.7]
 [5.4 3. ]
 [6.  3.4]
 [6.7 3.1]
 [6.3 2.3]
 [5.6 3. ]
 [5.5 2.5]
 [5.5 2.6]
 [6.1 3. ]
 [5.8 2.6]
 [5.  2.3]
 [5.6 2.7]
 [5.7 3. ]
 [5.7 2.9]
 [6.2 2.9]
 [5.1 2.5]
 [5.7 2.8]
 [6.3 3.3]
 [5.8 2.7]
 [7.1 3. ]
 [6.3 2.9]
 [6.5 3. ]
 [7.6 3. ]
 [4.9 2.5]
 [7.3 2.9]
 [6.7 2.5]
 [7.2 3.6]
 [6.5 3.2]
 [6.4 2.7]
 [6.8 3. ]
 [5.7 2.5]
 [5.8 2.8]
 [6.4 3.2]
 [6.5 3. ]
 [7.7 3.8]
 [7.7 2.6]
 [6.  2.2]
 [6.9 3.2]
 [5.6 2.8]
 [7.7 2.8]
 [6.3 2.7]
 [6.7 3.3]
 [7.2 3.2]
 [6.2 2.8]
 [6.1 3. ]
 [6.4 2.8]
 [7.2 3. ]
 [7.4 2.8]
 [7.9 3.8]
 [6.4 2.8]
 [6.3 2.8]
 [6.1 2.6]
 [7.7 3. ]
 [6.3 3.4]
 [6.4 3.1]
 [6.  3. ]
 [6.9 3.1]
 [6.7 3.1]
 [6.9 3.1]
 [5.8 2.7]
 [6.8 3.2]
 [6.7 3.3]
 [6.7 3. ]
 [6.3 2.5]
 [6.5 3. ]
 [6.2 3.4]
 [5.9 3. ]]

Array for case 2:
[[1.4 0.2]
 [1.4 0.2]
 [1.3 0.2]
 [1.5 0.2]
 [1.4 0.2]
 [1.7 0.4]
 [1.4 0.3]
 [1.5 0.2]
 [1.4 0.2]
 [1.5 0.1]
 [1.5 0.2]
 [1.6 0.2]
 [1.4 0.1]
 [1.1 0.1]
 [1.2 0.2]
 [1.5 0.4]
 [1.3 0.4]
 [1.4 0.3]
 [1.7 0.3]
 [1.5 0.3]
 [1.7 0.2]
 [1.5 0.4]
 [1.  0.2]
 [1.7 0.5]
 [1.9 0.2]
 [1.6 0.2]
 [1.6 0.4]
 [1.5 0.2]
 [1.4 0.2]
 [1.6 0.2]
 [1.6 0.2]
 [1.5 0.4]
 [1.5 0.1]
 [1.4 0.2]
 [1.5 0.1]
 [1.2 0.2]
 [1.3 0.2]
 [1.5 0.1]
 [1.3 0.2]
 [1.5 0.2]
 [1.3 0.3]
 [1.3 0.3]
 [1.3 0.2]
 [1.6 0.6]
 [1.9 0.4]
 [1.4 0.3]
 [1.6 0.2]
 [1.4 0.2]
 [1.5 0.2]
 [1.4 0.2]
 [4.7 1.4]
 [4.5 1.5]
 [4.9 1.5]
 [4.  1.3]
 [4.6 1.5]
 [4.5 1.3]
 [4.7 1.6]
 [3.3 1. ]
 [4.6 1.3]
 [3.9 1.4]
 [3.5 1. ]
 [4.2 1.5]
 [4.  1. ]
 [4.7 1.4]
 [3.6 1.3]
 [4.4 1.4]
 [4.5 1.5]
 [4.1 1. ]
 [4.5 1.5]
 [3.9 1.1]
 [4.8 1.8]
 [4.  1.3]
 [4.9 1.5]
 [4.7 1.2]
 [4.3 1.3]
 [4.4 1.4]
 [4.8 1.4]
 [5.  1.7]
 [4.5 1.5]
 [3.5 1. ]
 [3.8 1.1]
 [3.7 1. ]
 [3.9 1.2]
 [5.1 1.6]
 [4.5 1.5]
 [4.5 1.6]
 [4.7 1.5]
 [4.4 1.3]
 [4.1 1.3]
 [4.  1.3]
 [4.4 1.2]
 [4.6 1.4]
 [4.  1.2]
 [3.3 1. ]
 [4.2 1.3]
 [4.2 1.2]
 [4.2 1.3]
 [4.3 1.3]
 [3.  1.1]
 [4.1 1.3]
 [6.  2.5]
 [5.1 1.9]
 [5.9 2.1]
 [5.6 1.8]
 [5.8 2.2]
 [6.6 2.1]
 [4.5 1.7]
 [6.3 1.8]
 [5.8 1.8]
 [6.1 2.5]
 [5.1 2. ]
 [5.3 1.9]
 [5.5 2.1]
 [5.  2. ]
 [5.1 2.4]
 [5.3 2.3]
 [5.5 1.8]
 [6.7 2.2]
 [6.9 2.3]
 [5.  1.5]
 [5.7 2.3]
 [4.9 2. ]
 [6.7 2. ]
 [4.9 1.8]
 [5.7 2.1]
 [6.  1.8]
 [4.8 1.8]
 [4.9 1.8]
 [5.6 2.1]
 [5.8 1.6]
 [6.1 1.9]
 [6.4 2. ]
 [5.6 2.2]
 [5.1 1.5]
 [5.6 1.4]
 [6.1 2.3]
 [5.6 2.4]
 [5.5 1.8]
 [4.8 1.8]
 [5.4 2.1]
 [5.6 2.4]
 [5.1 2.3]
 [5.1 1.9]
 [5.9 2.3]
 [5.7 2.5]
 [5.2 2.3]
 [5.  1.9]
 [5.2 2. ]
 [5.4 2.3]
 [5.1 1.8]]

Visualize the distribution of the data per feature after normalization.

In [ ]:
# Define number of bins for histogram
nbins = 16
for i, array in enumerate(resulting_arrays):

    # Number of selected features
    nfeatures = len(case_info[i]['features'])
    print(f"Number of selected features for case {i}: {nfeatures}")

    # Plot histograms for each of the selected features
    fig, axs = plt.subplots(1, nfeatures, figsize=(nfeatures * 4, 6))
    for feature in range(nfeatures):
        print(feature)
        axs[feature].hist(array[:, feature], nbins)
        axs[feature].set_title(case_info[i]['features'][feature])  # Use the correct feature name

    plt.show()
Number of selected features for case 0: 4
0
1
2
3
No description has been provided for this image
Number of selected features for case 1: 2
0
1
No description has been provided for this image
Number of selected features for case 2: 2
0
1
No description has been provided for this image

🪓 Splitting into train/test¶

It's important here to realize that we do not split the data into train and test data. Clustering is (primarily) unsupervised, so we do not split the data into train and test.

Modelling¶

To perform hierarchical clustering, we use sklearn.cluster.AgglomerativeClustering.

See https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

In [ ]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

agglom = []
for i, array in enumerate(resulting_arrays):
    # Define number of clusters by setting distance threshold
    agglom_append = AgglomerativeClustering(distance_threshold=10, n_clusters=None)

    # Use this line for agglomerative clustering using normalized data
    agglom_append.fit(array)

    agglom.append(agglom_append)

kmean_array = []

for i, array in enumerate(resulting_arrays):

    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(array)

    kmeans = KMeans(n_clusters=3)
    kmeans.fit(scaled_data)
    
    kmean_array.append(kmeans)

DBSCAN_array = []

for i, array in enumerate(resulting_arrays):

    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(array)

    dbscan = DBSCAN(eps=0.45)
    dbscan.fit(scaled_data)  

    DBSCAN_array.append(dbscan)

Number of clusters found by the algorithm. If parameter distance_threshold=None, it will be equal to the given n_clusters.

In [ ]:
for agg in agglom:
    print('Number of clusters: ', agg.n_clusters_)

for kmean in kmean_array:
    print('Number of clusters: ', kmean.n_clusters)
Number of clusters:  3
Number of clusters:  2
Number of clusters:  3
Number of clusters:  3
Number of clusters:  3
Number of clusters:  3

Cluster labels are stored in an ndarray of shape (n_samples).

In [ ]:
for agg in agglom:
    print(np.unique(agg.labels_))

for kmean in kmean_array:
    print(np.unique(kmean.labels_))

for DBscan in DBSCAN_array:
    print(np.unique(DBscan.labels_))
[0 1 2]
[0 1]
[0 1 2]
[0 1 2]
[0 1 2]
[0 1 2]
[-1  0  1  2]
[-1  0  1]
[0 1]

Number of leaves in the hierarchical tree.

In [ ]:
for agg in agglom:
    print(agg.n_leaves_)

for kmean in kmean_array:
    print(kmean.n_features_in_)

for DBscan in DBSCAN_array:
    print(DBscan.n_features_in_)
150
150
150
4
2
2
4
2
2
In [ ]:
for agg in agglom:
    print(agg.distances_)

for kmean in kmean_array:
    print(kmean.inertia_)

for DBscan in DBSCAN_array:
    print(DBscan.components_)
[ 0.          0.          0.          0.1         0.1         0.1
  0.1         0.14142136  0.14142136  0.14142136  0.14142136  0.14142136
  0.14142136  0.14142136  0.14142136  0.14142136  0.14142136  0.14142136
  0.14142136  0.14142136  0.14142136  0.17320508  0.17320508  0.17320508
  0.17320508  0.17320508  0.17320508  0.18257419  0.18257419  0.2
  0.2         0.2         0.2         0.21602469  0.21602469  0.2236068
  0.2236068   0.24494897  0.24494897  0.24494897  0.24494897  0.24494897
  0.24494897  0.24494897  0.25819889  0.26457513  0.26457513  0.26457513
  0.26457513  0.26457513  0.27080128  0.28284271  0.28982753  0.29439203
  0.29439203  0.29439203  0.29439203  0.30550505  0.31358146  0.31622777
  0.32145503  0.33166248  0.33166248  0.33166248  0.33665016  0.34156503
  0.34641016  0.34641016  0.34778209  0.35118846  0.35182066  0.35355339
  0.35823642  0.36055513  0.36285902  0.36968455  0.37416574  0.37416574
  0.4         0.41231056  0.41231056  0.41472883  0.42229532  0.43969687
  0.43969687  0.44521263  0.4472136   0.45825757  0.46547467  0.46726153
  0.48131764  0.48166378  0.48989795  0.51478151  0.52915026  0.5329165
  0.53851648  0.54772256  0.57445626  0.58022984  0.59441848  0.60580525
  0.61373175  0.6244998   0.63087241  0.6363961   0.64031242  0.64291005
  0.66269651  0.70945989  0.72456884  0.73257537  0.73409052  0.73496032
  0.75535128  0.76321688  0.80622577  0.82187359  0.82613558  0.83740671
  0.85556999  0.85780728  0.86458082  0.92870878  0.92915732  1.00534287
  1.04705937  1.10513951  1.15325626  1.21700908  1.29839645  1.3048627
  1.39697997  1.41139074  1.49547731  1.5977663   1.75916647  1.76044502
  1.84584475  1.86837719  1.91608028  2.05363058  2.81393883  2.86941764
  3.8758436   4.84770851  6.39940682 12.30039605 32.42801258]
[ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.1         0.1
  0.1         0.1         0.1         0.1         0.1         0.1
  0.1         0.1         0.1         0.1         0.1         0.1
  0.1         0.1         0.1         0.1         0.1         0.1
  0.11547005  0.11547005  0.11547005  0.11547005  0.11547005  0.11547005
  0.11547005  0.11547005  0.11547005  0.11547005  0.12247449  0.12909944
  0.12909944  0.12909944  0.12909944  0.12909944  0.12909944  0.14142136
  0.14142136  0.14142136  0.14142136  0.15705625  0.15705625  0.15811388
  0.15811388  0.15811388  0.16329932  0.16329932  0.17320508  0.17320508
  0.17320508  0.18257419  0.2         0.2         0.2         0.2081666
  0.22060523  0.2236068   0.2236068   0.2236068   0.2236068   0.23094011
  0.23804761  0.23804761  0.23804761  0.27602622  0.27877282  0.28284271
  0.28867513  0.29154759  0.29439203  0.29800927  0.31411251  0.31622777
  0.31622777  0.35962944  0.36055513  0.39809068  0.43011626  0.43011626
  0.43461349  0.43969687  0.44409021  0.45680047  0.46097722  0.50347995
  0.53229065  0.54416092  0.55901699  0.57264481  0.58309519  0.6041523
  0.6611678   0.71867934  0.72071393  0.73029674  0.7771203   0.82121314
  0.84327404  0.93340774  0.95713693  1.08639311  1.14041903  1.22789863
  1.4664443   1.62484125  1.75657495  2.03829469  2.05304652  2.42712703
  2.58742277  4.43879833  4.68991092  6.09685802 11.68678645]
[ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.1         0.1         0.1         0.1         0.1         0.1
  0.1         0.1         0.1         0.1         0.1         0.1
  0.1         0.1         0.11547005  0.11547005  0.11547005  0.11547005
  0.11547005  0.12247449  0.12247449  0.12649111  0.12909944  0.12909944
  0.12909944  0.12909944  0.12909944  0.12909944  0.12909944  0.14142136
  0.14142136  0.14142136  0.14142136  0.14142136  0.15491933  0.16329932
  0.16329932  0.16329932  0.16832508  0.16832508  0.17320508  0.17320508
  0.18257419  0.18257419  0.18257419  0.18898224  0.2         0.2
  0.2081666   0.21291626  0.21984843  0.21984843  0.2236068   0.23094011
  0.2319688   0.23804761  0.23804761  0.23804761  0.23817488  0.24494897
  0.26140645  0.26992062  0.27386128  0.28867513  0.30934411  0.31937439
  0.35118846  0.35118846  0.35237291  0.36055513  0.36514837  0.36968455
  0.37638633  0.41231056  0.43602891  0.49244289  0.49665548  0.50205925
  0.518698    0.54693954  0.56082422  0.56441173  0.62948392  0.64888038
  0.69321386  0.69863813  0.72341781  0.7787623   0.91215932  0.98961272
  1.15368134  1.16045968  1.35549637  1.66704658  1.75080719  1.78728978
  2.40933206  3.56917392  4.67443318 10.52767915 30.43899692]
141.15417813388655
103.81453420646659
18.046983891906272
[[-0.90068117  1.03205722 -1.3412724  -1.31297673]
 [-1.14301691 -0.1249576  -1.3412724  -1.31297673]
 [-1.38535265  0.33784833 -1.39813811 -1.31297673]
 [-1.50652052  0.10644536 -1.2844067  -1.31297673]
 [-1.02184904  1.26346019 -1.3412724  -1.31297673]
 [-1.02184904  0.80065426 -1.2844067  -1.31297673]
 [-1.14301691  0.10644536 -1.2844067  -1.4444497 ]
 [-1.26418478  0.80065426 -1.227541   -1.31297673]
 [-1.26418478 -0.1249576  -1.3412724  -1.4444497 ]
 [-0.90068117  1.03205722 -1.3412724  -1.18150376]
 [-0.90068117  1.72626612 -1.2844067  -1.18150376]
 [-0.53717756  0.80065426 -1.17067529 -1.31297673]
 [-0.90068117  1.49486315 -1.2844067  -1.05003079]
 [-1.26418478  0.80065426 -1.05694388 -1.31297673]
 [-1.02184904 -0.1249576  -1.227541   -1.31297673]
 [-1.02184904  0.80065426 -1.227541   -1.05003079]
 [-0.7795133   1.03205722 -1.2844067  -1.31297673]
 [-0.7795133   0.80065426 -1.3412724  -1.31297673]
 [-1.38535265  0.33784833 -1.227541   -1.31297673]
 [-1.26418478  0.10644536 -1.227541   -1.31297673]
 [-0.53717756  0.80065426 -1.2844067  -1.05003079]
 [-1.14301691  0.10644536 -1.2844067  -1.4444497 ]
 [-1.02184904  0.33784833 -1.45500381 -1.31297673]
 [-0.41600969  1.03205722 -1.39813811 -1.31297673]
 [-1.14301691  0.10644536 -1.2844067  -1.4444497 ]
 [-0.90068117  0.80065426 -1.2844067  -1.31297673]
 [-1.02184904  1.03205722 -1.39813811 -1.18150376]
 [-1.74885626  0.33784833 -1.39813811 -1.31297673]
 [-0.90068117  1.72626612 -1.05694388 -1.05003079]
 [-1.26418478 -0.1249576  -1.3412724  -1.18150376]
 [-0.90068117  1.72626612 -1.227541   -1.31297673]
 [-1.50652052  0.33784833 -1.3412724  -1.31297673]
 [-0.65834543  1.49486315 -1.2844067  -1.31297673]
 [-1.02184904  0.56925129 -1.3412724  -1.31297673]
 [ 1.2803405   0.10644536  0.64902723  0.39617188]
 [ 0.79566902 -0.58776353  0.47843012  0.39617188]
 [-0.17367395 -0.58776353  0.42156442  0.13322594]
 [ 0.91683689 -0.35636057  0.47843012  0.13322594]
 [ 0.06866179 -0.1249576   0.25096731  0.39617188]
 [ 0.31099753 -0.35636057  0.53529583  0.26469891]
 [-0.29484182 -0.35636057 -0.09022692  0.13322594]
 [-0.29484182 -0.1249576   0.42156442  0.39617188]
 [-0.29484182 -1.28197243  0.08037019 -0.12972   ]
 [ 0.67450115 -0.35636057  0.30783301  0.13322594]
 [ 0.91683689 -0.1249576   0.36469871  0.26469891]
 [ 0.18982966 -0.35636057  0.42156442  0.39617188]
 [-0.17367395 -1.05056946 -0.14709262 -0.26119297]
 [-0.41600969 -1.51337539  0.02350449 -0.12972   ]
 [-0.05250608 -0.8191665   0.08037019  0.00175297]
 [ 1.03800476  0.10644536  0.53529583  0.39617188]
 [-0.29484182 -0.1249576   0.1941016   0.13322594]
 [-0.41600969 -1.05056946  0.36469871  0.00175297]
 [ 0.31099753 -0.1249576   0.47843012  0.26469891]
 [-0.05250608 -1.05056946  0.1372359   0.00175297]
 [-0.29484182 -0.8191665   0.25096731  0.13322594]
 [-0.17367395 -0.1249576   0.25096731  0.00175297]
 [-0.17367395 -0.35636057  0.25096731  0.13322594]
 [ 0.4321654  -0.35636057  0.30783301  0.13322594]
 [-0.17367395 -0.58776353  0.1941016   0.13322594]
 [ 0.79566902 -0.1249576   1.16081857  1.31648267]
 [ 1.15917263 -0.1249576   0.99022146  1.1850097 ]
 [ 0.79566902 -0.1249576   0.99022146  0.79059079]
 [ 1.2803405   0.33784833  1.10395287  1.44795564]
 [ 1.2803405   0.10644536  0.93335575  1.1850097 ]
 [ 1.03800476  0.10644536  1.04708716  1.57942861]
 [ 1.2803405   0.10644536  0.76275864  1.44795564]
 [ 1.15917263  0.33784833  1.21768427  1.44795564]
 [ 1.03800476 -0.1249576   0.81962435  1.44795564]
 [ 0.79566902 -0.1249576   0.81962435  1.05353673]]
[[-0.90068117  1.03205722]
 [-1.14301691 -0.1249576 ]
 [-1.38535265  0.33784833]
 [-1.50652052  0.10644536]
 [-1.02184904  1.26346019]
 [-0.53717756  1.95766909]
 [-1.02184904  0.80065426]
 [-1.14301691  0.10644536]
 [-0.53717756  1.49486315]
 [-1.26418478  0.80065426]
 [-1.26418478 -0.1249576 ]
 [-0.53717756  1.95766909]
 [-0.90068117  1.03205722]
 [-0.90068117  1.72626612]
 [-0.53717756  0.80065426]
 [-0.90068117  1.49486315]
 [-0.90068117  0.56925129]
 [-1.26418478  0.80065426]
 [-1.02184904 -0.1249576 ]
 [-1.02184904  0.80065426]
 [-0.7795133   1.03205722]
 [-0.7795133   0.80065426]
 [-1.38535265  0.33784833]
 [-1.26418478  0.10644536]
 [-0.53717756  0.80065426]
 [-1.14301691  0.10644536]
 [-1.02184904  0.33784833]
 [-0.41600969  1.03205722]
 [-1.14301691  0.10644536]
 [-0.90068117  0.80065426]
 [-1.02184904  1.03205722]
 [-1.74885626  0.33784833]
 [-1.02184904  1.03205722]
 [-0.90068117  1.72626612]
 [-1.26418478 -0.1249576 ]
 [-0.90068117  1.72626612]
 [-1.50652052  0.33784833]
 [-0.65834543  1.49486315]
 [-1.02184904  0.56925129]
 [ 1.40150837  0.33784833]
 [ 0.67450115  0.33784833]
 [ 1.2803405   0.10644536]
 [ 0.79566902 -0.58776353]
 [-0.17367395 -0.58776353]
 [ 0.55333328  0.56925129]
 [ 0.91683689 -0.35636057]
 [ 0.06866179 -0.1249576 ]
 [ 0.31099753 -0.35636057]
 [-0.29484182 -0.35636057]
 [ 1.03800476  0.10644536]
 [-0.29484182 -0.1249576 ]
 [-0.05250608 -0.8191665 ]
 [-0.29484182 -1.28197243]
 [ 0.31099753 -0.58776353]
 [ 0.31099753 -0.58776353]
 [ 0.67450115 -0.35636057]
 [ 0.91683689 -0.1249576 ]
 [ 1.03800476 -0.1249576 ]
 [ 0.18982966 -0.35636057]
 [-0.17367395 -1.05056946]
 [-0.41600969 -1.51337539]
 [-0.41600969 -1.51337539]
 [-0.05250608 -0.8191665 ]
 [ 0.18982966 -0.8191665 ]
 [-0.53717756 -0.1249576 ]
 [ 0.18982966  0.80065426]
 [ 1.03800476  0.10644536]
 [-0.29484182 -0.1249576 ]
 [-0.41600969 -1.28197243]
 [-0.41600969 -1.05056946]
 [ 0.31099753 -0.1249576 ]
 [-0.05250608 -1.05056946]
 [-0.29484182 -0.8191665 ]
 [-0.17367395 -0.1249576 ]
 [-0.17367395 -0.35636057]
 [ 0.4321654  -0.35636057]
 [-0.17367395 -0.58776353]
 [ 0.55333328  0.56925129]
 [-0.05250608 -0.8191665 ]
 [ 1.52267624 -0.1249576 ]
 [ 0.55333328 -0.35636057]
 [ 0.79566902 -0.1249576 ]
 [ 1.76501198 -0.35636057]
 [ 0.79566902  0.33784833]
 [ 0.67450115 -0.8191665 ]
 [ 1.15917263 -0.1249576 ]
 [-0.17367395 -1.28197243]
 [-0.05250608 -0.58776353]
 [ 0.67450115  0.33784833]
 [ 0.79566902 -0.1249576 ]
 [ 1.2803405   0.33784833]
 [-0.29484182 -0.58776353]
 [ 0.55333328 -0.8191665 ]
 [ 1.03800476  0.56925129]
 [ 1.64384411  0.33784833]
 [ 0.4321654  -0.58776353]
 [ 0.31099753 -0.1249576 ]
 [ 0.67450115 -0.58776353]
 [ 1.64384411 -0.1249576 ]
 [ 0.67450115 -0.58776353]
 [ 0.55333328 -0.58776353]
 [ 0.31099753 -1.05056946]
 [ 0.55333328  0.80065426]
 [ 0.67450115  0.10644536]
 [ 0.18982966 -0.1249576 ]
 [ 1.2803405   0.10644536]
 [ 1.03800476  0.10644536]
 [ 1.2803405   0.10644536]
 [-0.05250608 -0.8191665 ]
 [ 1.15917263  0.33784833]
 [ 1.03800476  0.56925129]
 [ 1.03800476 -0.1249576 ]
 [ 0.79566902 -0.1249576 ]
 [ 0.4321654   0.80065426]
 [ 0.06866179 -0.1249576 ]]
[[-1.3412724  -1.31297673]
 [-1.3412724  -1.31297673]
 [-1.39813811 -1.31297673]
 [-1.2844067  -1.31297673]
 [-1.3412724  -1.31297673]
 [-1.17067529 -1.05003079]
 [-1.3412724  -1.18150376]
 [-1.2844067  -1.31297673]
 [-1.3412724  -1.31297673]
 [-1.2844067  -1.4444497 ]
 [-1.2844067  -1.31297673]
 [-1.227541   -1.31297673]
 [-1.3412724  -1.4444497 ]
 [-1.51186952 -1.4444497 ]
 [-1.45500381 -1.31297673]
 [-1.2844067  -1.05003079]
 [-1.39813811 -1.05003079]
 [-1.3412724  -1.18150376]
 [-1.17067529 -1.18150376]
 [-1.2844067  -1.18150376]
 [-1.17067529 -1.31297673]
 [-1.2844067  -1.05003079]
 [-1.56873522 -1.31297673]
 [-1.17067529 -0.91855782]
 [-1.05694388 -1.31297673]
 [-1.227541   -1.31297673]
 [-1.227541   -1.05003079]
 [-1.2844067  -1.31297673]
 [-1.3412724  -1.31297673]
 [-1.227541   -1.31297673]
 [-1.227541   -1.31297673]
 [-1.2844067  -1.05003079]
 [-1.2844067  -1.4444497 ]
 [-1.3412724  -1.31297673]
 [-1.2844067  -1.4444497 ]
 [-1.45500381 -1.31297673]
 [-1.39813811 -1.31297673]
 [-1.2844067  -1.4444497 ]
 [-1.39813811 -1.31297673]
 [-1.2844067  -1.31297673]
 [-1.39813811 -1.18150376]
 [-1.39813811 -1.18150376]
 [-1.39813811 -1.31297673]
 [-1.227541   -0.78708485]
 [-1.05694388 -1.05003079]
 [-1.3412724  -1.18150376]
 [-1.227541   -1.31297673]
 [-1.3412724  -1.31297673]
 [-1.2844067  -1.31297673]
 [-1.3412724  -1.31297673]
 [ 0.53529583  0.26469891]
 [ 0.42156442  0.39617188]
 [ 0.64902723  0.39617188]
 [ 0.1372359   0.13322594]
 [ 0.47843012  0.39617188]
 [ 0.42156442  0.13322594]
 [ 0.53529583  0.52764485]
 [-0.26082403 -0.26119297]
 [ 0.47843012  0.13322594]
 [ 0.08037019  0.26469891]
 [-0.14709262 -0.26119297]
 [ 0.25096731  0.39617188]
 [ 0.1372359  -0.26119297]
 [ 0.53529583  0.26469891]
 [-0.09022692  0.13322594]
 [ 0.36469871  0.26469891]
 [ 0.42156442  0.39617188]
 [ 0.1941016  -0.26119297]
 [ 0.42156442  0.39617188]
 [ 0.08037019 -0.12972   ]
 [ 0.59216153  0.79059079]
 [ 0.1372359   0.13322594]
 [ 0.64902723  0.39617188]
 [ 0.53529583  0.00175297]
 [ 0.30783301  0.13322594]
 [ 0.36469871  0.26469891]
 [ 0.59216153  0.26469891]
 [ 0.70589294  0.65911782]
 [ 0.42156442  0.39617188]
 [-0.14709262 -0.26119297]
 [ 0.02350449 -0.12972   ]
 [-0.03336121 -0.26119297]
 [ 0.08037019  0.00175297]
 [ 0.76275864  0.52764485]
 [ 0.42156442  0.39617188]
 [ 0.42156442  0.52764485]
 [ 0.53529583  0.39617188]
 [ 0.36469871  0.13322594]
 [ 0.1941016   0.13322594]
 [ 0.1372359   0.13322594]
 [ 0.36469871  0.00175297]
 [ 0.47843012  0.26469891]
 [ 0.1372359   0.00175297]
 [-0.26082403 -0.26119297]
 [ 0.25096731  0.13322594]
 [ 0.25096731  0.00175297]
 [ 0.25096731  0.13322594]
 [ 0.30783301  0.13322594]
 [-0.43142114 -0.12972   ]
 [ 0.1941016   0.13322594]
 [ 1.27454998  1.71090158]
 [ 0.76275864  0.92206376]
 [ 1.21768427  1.1850097 ]
 [ 1.04708716  0.79059079]
 [ 1.16081857  1.31648267]
 [ 1.6157442   1.1850097 ]
 [ 0.42156442  0.65911782]
 [ 1.44514709  0.79059079]
 [ 1.16081857  0.79059079]
 [ 1.33141568  1.71090158]
 [ 0.76275864  1.05353673]
 [ 0.87649005  0.92206376]
 [ 0.99022146  1.1850097 ]
 [ 0.70589294  1.05353673]
 [ 0.76275864  1.57942861]
 [ 0.87649005  1.44795564]
 [ 0.99022146  0.79059079]
 [ 1.67260991  1.31648267]
 [ 0.70589294  0.39617188]
 [ 1.10395287  1.44795564]
 [ 0.64902723  1.05353673]
 [ 1.67260991  1.05353673]
 [ 0.64902723  0.79059079]
 [ 1.10395287  1.1850097 ]
 [ 1.27454998  0.79059079]
 [ 0.59216153  0.79059079]
 [ 0.64902723  0.79059079]
 [ 1.04708716  1.1850097 ]
 [ 1.16081857  0.52764485]
 [ 1.33141568  0.92206376]
 [ 1.50201279  1.05353673]
 [ 1.04708716  1.31648267]
 [ 0.76275864  0.39617188]
 [ 1.04708716  0.26469891]
 [ 1.33141568  1.44795564]
 [ 1.04708716  1.57942861]
 [ 0.99022146  0.79059079]
 [ 0.59216153  0.79059079]
 [ 0.93335575  1.1850097 ]
 [ 1.04708716  1.57942861]
 [ 0.76275864  1.44795564]
 [ 0.76275864  0.92206376]
 [ 1.21768427  1.44795564]
 [ 1.10395287  1.71090158]
 [ 0.81962435  1.44795564]
 [ 0.70589294  0.92206376]
 [ 0.81962435  1.05353673]
 [ 0.93335575  1.44795564]
 [ 0.76275864  0.79059079]]
In [ ]:
for agg in agglom:
    plt.hist(agg.distances_,30)
    plt.show()

for kmean in kmean_array:
    plt.hist(kmean.inertia_,30)
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Plot hierarchical clustering dendrogram.

The code below is adapted from https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html#sphx-glr-auto-examples-cluster-plot-agglomerative-dendrogram-py

In [ ]:
from scipy.cluster.hierarchy import dendrogram

def plot_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack([model.children_, model.distances_,
                                      counts]).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)

for agg in agglom:
    plt.title('Hierarchical Clustering Dendrogram')
    # plot the top three levels of the dendrogram
    plot_dendrogram(agg, truncate_mode='level', p=3)
    plt.xlabel("Number of points in node (or index of point if no parenthesis).")
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Inference¶

For each datapaint, add the cluster to the original Iris data set.

In [ ]:
for i, agg in enumerate(agglom):
    df_iris[F'Cluster {i}'] = agg.labels_.astype(str)
    df_iris[F'Cluster {i}'] = 'Cluster ' + df_iris[F'Cluster {i}']

for kmean in kmean_array:
    i += 1
    df_iris[F'Cluster {i}'] = kmean.labels_.astype(str)
    df_iris[F'Cluster {i}'] = 'Cluster ' + df_iris[F'Cluster {i}']

for DBscan in DBSCAN_array:
    i += 1
    df_iris[F'Cluster {i}'] = DBscan.labels_.astype(str)
    df_iris[F'Cluster {i}'] = 'Cluster ' + df_iris[F'Cluster {i}']

display(df_iris.head(5))

clusterattampts = i;
Sepal Length Sepal Width Petal Length Petal Width Species Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8
0 5.1 3.5 1.4 0.2 Iris-setosa Cluster 1 Cluster 1 Cluster 1 Cluster 0 Cluster 0 Cluster 2 Cluster 0 Cluster 0 Cluster 0
1 4.9 3.0 1.4 0.2 Iris-setosa Cluster 1 Cluster 1 Cluster 1 Cluster 0 Cluster 0 Cluster 2 Cluster 0 Cluster 0 Cluster 0
2 4.7 3.2 1.3 0.2 Iris-setosa Cluster 1 Cluster 1 Cluster 1 Cluster 0 Cluster 0 Cluster 2 Cluster 0 Cluster 0 Cluster 0
3 4.6 3.1 1.5 0.2 Iris-setosa Cluster 1 Cluster 1 Cluster 1 Cluster 0 Cluster 0 Cluster 2 Cluster 0 Cluster 0 Cluster 0
4 5.0 3.6 1.4 0.2 Iris-setosa Cluster 1 Cluster 1 Cluster 1 Cluster 0 Cluster 0 Cluster 2 Cluster 0 Cluster 0 Cluster 0
In [ ]:
display(df_iris.tail(5))
Sepal Length Sepal Width Petal Length Petal Width Species Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8
145 6.7 3.0 5.2 2.3 Iris-virginica Cluster 2 Cluster 0 Cluster 0 Cluster 2 Cluster 2 Cluster 1 Cluster 2 Cluster 1 Cluster 1
146 6.3 2.5 5.0 1.9 Iris-virginica Cluster 0 Cluster 0 Cluster 0 Cluster 1 Cluster 1 Cluster 1 Cluster -1 Cluster 1 Cluster 1
147 6.5 3.0 5.2 2.0 Iris-virginica Cluster 2 Cluster 0 Cluster 0 Cluster 2 Cluster 2 Cluster 1 Cluster 2 Cluster 1 Cluster 1
148 6.2 3.4 5.4 2.3 Iris-virginica Cluster 2 Cluster 0 Cluster 0 Cluster 2 Cluster 2 Cluster 1 Cluster -1 Cluster 1 Cluster 1
149 5.9 3.0 5.1 1.8 Iris-virginica Cluster 0 Cluster 0 Cluster 0 Cluster 1 Cluster 1 Cluster 1 Cluster -1 Cluster 1 Cluster 1

Plot pairwise relationships per species and per cluster using method seaborn.pairplot.

In [ ]:
plt.figure(figsize=(8,8))
ax = sns.pairplot(df_iris[["Sepal Length","Sepal Width","Petal Length","Petal Width","Species"]], hue="Species")
plt.show
for i in range(0, clusterattampts + 1):
    plt.figure(figsize=(8,8))
    ax = sns.pairplot(df_iris[["Sepal Length","Sepal Width","Petal Length","Petal Width", F'Cluster {i}']], hue=F'Cluster {i}')
    plt.show
<Figure size 800x800 with 0 Axes>
No description has been provided for this image
<Figure size 800x800 with 0 Axes>
No description has been provided for this image
<Figure size 800x800 with 0 Axes>
No description has been provided for this image
<Figure size 800x800 with 0 Axes>
No description has been provided for this image
<Figure size 800x800 with 0 Axes>
No description has been provided for this image
<Figure size 800x800 with 0 Axes>
No description has been provided for this image
<Figure size 800x800 with 0 Axes>
No description has been provided for this image
<Figure size 800x800 with 0 Axes>
No description has been provided for this image
<Figure size 800x800 with 0 Axes>
No description has been provided for this image
<Figure size 800x800 with 0 Axes>
No description has been provided for this image

Evaluation¶

If clustering is successful, one may expect that flowers of the same species end up in the same cluster. Let us check whether this is the case.

Code for the bar plot is adapted from https://matplotlib.org/stable/gallery/lines_bars_and_markers/barchart.html#sphx-glr-gallery-lines-bars-and-markers-barchart-py

In [ ]:
# Define labels for species
species = df_iris['Species'].unique()

for i in range(0, clusterattampts + 1):
    # Define labels for clusters
    clusters = df_iris[F'Cluster {i}'].unique()

    # Sort cluster names in alphabetical order, i.e.,
    # Cluster 0, Cluster 1, Cluster 2, etc.
    clusters.sort()

    # Determine the location for cluster labels 
    x = np.arange(len(clusters))

    # Define the width of the bars
    width = 0.25

    # Create the bar plot
    fig, ax = plt.subplots()
    offset = -width
    for spec in species:
        nr_occurrences = []
        for clus in clusters:
            nr = df_iris[(df_iris['Species']==spec) & (df_iris[F'Cluster {i}']==clus)][F'Cluster {i}'].count()
            nr_occurrences.append(nr)
        rects = ax.bar(x + offset, nr_occurrences, width, label=spec)
        offset = offset + width

    # Add text for labels, title and custom x-axis tick labels, etc.
    ax.set_ylabel('Number of occurrences')
    ax.set_title(f'Number of occurrences in cluster using method: {i}')
    ax.set_xticks(x)
    ax.set_xticklabels(clusters)
    ax.legend()

    fig.tight_layout()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

it seems after clustering with all three (0,1,2) possabilities, option 2 with 'Petal Length'and 'Petal Width' seems to do best.

but when i use kmean (3,4,5) it is doesn't if you take 'Petal Length'and 'Petal Width' or 'Sepal Length', 'Sepal Width'. they are close enough to each other to say thay can perform the same.

In [ ]:
print(df_iris['Species'].value_counts())
for i in range(0, clusterattampts + 1):
    print(F'\nusing method: {i}:')
    print(df_iris[F'Cluster {i}'].value_counts())

    species = df_iris['Species'].unique()
    for spec in species:
        print('Number of samples per cluster for',spec)
        print(df_iris[df_iris['Species']==spec][F'Cluster {i}'].value_counts())
Species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

using method: 0:
Cluster 0
Cluster 0    64
Cluster 1    50
Cluster 2    36
Name: count, dtype: int64
Number of samples per cluster for Iris-setosa
Cluster 0
Cluster 1    50
Name: count, dtype: int64
Number of samples per cluster for Iris-versicolor
Cluster 0
Cluster 0    49
Cluster 2     1
Name: count, dtype: int64
Number of samples per cluster for Iris-virginica
Cluster 0
Cluster 2    35
Cluster 0    15
Name: count, dtype: int64

using method: 1:
Cluster 1
Cluster 0    94
Cluster 1    56
Name: count, dtype: int64
Number of samples per cluster for Iris-setosa
Cluster 1
Cluster 1    50
Name: count, dtype: int64
Number of samples per cluster for Iris-versicolor
Cluster 1
Cluster 0    45
Cluster 1     5
Name: count, dtype: int64
Number of samples per cluster for Iris-virginica
Cluster 1
Cluster 0    49
Cluster 1     1
Name: count, dtype: int64

using method: 2:
Cluster 2
Cluster 0    54
Cluster 1    50
Cluster 2    46
Name: count, dtype: int64
Number of samples per cluster for Iris-setosa
Cluster 2
Cluster 1    50
Name: count, dtype: int64
Number of samples per cluster for Iris-versicolor
Cluster 2
Cluster 2    45
Cluster 0     5
Name: count, dtype: int64
Number of samples per cluster for Iris-virginica
Cluster 2
Cluster 0    49
Cluster 2     1
Name: count, dtype: int64

using method: 3:
Cluster 3
Cluster 1    56
Cluster 0    50
Cluster 2    44
Name: count, dtype: int64
Number of samples per cluster for Iris-setosa
Cluster 3
Cluster 0    50
Name: count, dtype: int64
Number of samples per cluster for Iris-versicolor
Cluster 3
Cluster 1    39
Cluster 2    11
Name: count, dtype: int64
Number of samples per cluster for Iris-virginica
Cluster 3
Cluster 2    33
Cluster 1    17
Name: count, dtype: int64

using method: 4:
Cluster 4
Cluster 1    57
Cluster 0    51
Cluster 2    42
Name: count, dtype: int64
Number of samples per cluster for Iris-setosa
Cluster 4
Cluster 0    49
Cluster 1     1
Name: count, dtype: int64
Number of samples per cluster for Iris-versicolor
Cluster 4
Cluster 1    36
Cluster 2    12
Cluster 0     2
Name: count, dtype: int64
Number of samples per cluster for Iris-virginica
Cluster 4
Cluster 2    30
Cluster 1    20
Name: count, dtype: int64

using method: 5:
Cluster 5
Cluster 0    52
Cluster 2    50
Cluster 1    48
Name: count, dtype: int64
Number of samples per cluster for Iris-setosa
Cluster 5
Cluster 2    50
Name: count, dtype: int64
Number of samples per cluster for Iris-versicolor
Cluster 5
Cluster 0    48
Cluster 1     2
Name: count, dtype: int64
Number of samples per cluster for Iris-virginica
Cluster 5
Cluster 1    46
Cluster 0     4
Name: count, dtype: int64

using method: 6:
Cluster 6
Cluster -1    57
Cluster 0     40
Cluster 1     38
Cluster 2     15
Name: count, dtype: int64
Number of samples per cluster for Iris-setosa
Cluster 6
Cluster 0     40
Cluster -1    10
Name: count, dtype: int64
Number of samples per cluster for Iris-versicolor
Cluster 6
Cluster 1     37
Cluster -1    13
Name: count, dtype: int64
Number of samples per cluster for Iris-virginica
Cluster 6
Cluster -1    34
Cluster 2     15
Cluster 1      1
Name: count, dtype: int64

using method: 7:
Cluster 7
Cluster 1     83
Cluster 0     43
Cluster -1    24
Name: count, dtype: int64
Number of samples per cluster for Iris-setosa
Cluster 7
Cluster 0     43
Cluster -1     7
Name: count, dtype: int64
Number of samples per cluster for Iris-versicolor
Cluster 7
Cluster 1     42
Cluster -1     8
Name: count, dtype: int64
Number of samples per cluster for Iris-virginica
Cluster 7
Cluster 1     41
Cluster -1     9
Name: count, dtype: int64

using method: 8:
Cluster 8
Cluster 1    100
Cluster 0     50
Name: count, dtype: int64
Number of samples per cluster for Iris-setosa
Cluster 8
Cluster 0    50
Name: count, dtype: int64
Number of samples per cluster for Iris-versicolor
Cluster 8
Cluster 1    50
Name: count, dtype: int64
Number of samples per cluster for Iris-virginica
Cluster 8
Cluster 1    50
Name: count, dtype: int64